Skip to content

Conversation

@vicLin8712
Copy link
Collaborator

@vicLin8712 vicLin8712 commented Nov 18, 2025

Feature of this design

  • Bitmap and runnable tasks ready queue: O(1) task selection.
  • O(1) tracing of task count in each ready queue
  • Debrujin LUT for the highest priority queue calculation.
  • Strict priority execution
  • Round-Robin in each priority queue
  • Embeds list node for task state transition
  • System idle task

Behavior test

Refer to the commit in the branch of forked repo . 02564c9

Bitmap & task counts consistency accross task states**

State group1: {TASK_RUNNING, TASK_READY}

  • Task must in ready queue
  • Bitmap must be set

State group2: {Others }
Task must not in ready queue.

Task state transitions

  • Creation (TASK_READY) – initial enqueue and priority bit set.

  • Priority change – priority migration updates to queue placement and the corresponding bitmap bit.

  • Suspension (TASK_READYTASK_SUSPEND) – dequeued from the ready queue and priority bit cleared.

  • Resumption (TASK_SUSPENDTASK_READY) – re-enqueued with correct priority placement.

  • Cancellation (TASK_READYTASK_CANCELLED) – removed from ready queues and all bitmap bits fully cleared.

  • Blocked task behavior (TASK_RUNNINGTASK_BLOCKED)

    • The delay task is created and its priority is promoted to match the controller task’s priority (TASK_READY).
    • After the controller yields, the delay task becomes the running task, invokes mo_task_delay(), and transitions to TASK_BLOCKED.
    • Back to controller, verified: task count and bitmap.

Cursor consistency

States
a. If tasks count >= 2; the cursor points to the task not the same as current task, if they same, advance the cursor.
b. If task count = 1; the cursor always points to the only task.
c. If task count = 0; cursor point to NULL.

Verified

  • New task create: task count 0 -> 1; cursor NULL -> new task.
  • Above task migrate priority to CRIT:
    • Check CRIT queue cursor; from state b.(Original only controller task) to a.(new task migrate from NORAML)
    • Check NORMAL queue cursor; from state b.(Original only new created task) to c. (migrated to CRIT queue)
  • Remove new task: state a. to b.

Result

Linmo kernel is starting...
Heap initialized, 130000924 bytes available
idle id 1: entry=80002dcc stack=80005864 size=8192
task 2: entry=80003a4c stack=80007918 size=1024 prio_level=4 time_slice=5
PASS: Bitmap is consistent when priority migration
PASS: Task count is consistent when priority migration
PASS: Bitmap is consistent when TASK_SUSPENDED
PASS: Task count is consistent when TASK_SUSPENDED
PASS: Bitmap is consistent when TASK_READY from TASK_SUSPENDED
PASS: Task count is consistent when TASK_READY from TASK_SUSPENDED
PASS: Bitmap is consistent when task canceled
PASS: Task count is consistent when task canceled
task 5: entry=800001b8 stack=80009f08 size=8192 prio_level=4 time_slice=5
PASS: Task count is consistent when task canceled
PASS: Task count is consistent when task blocked
task 6: entry=800001a8 stack=80009f08 size=8192 prio_level=4 time_slice=5
PASS:  Cursor setup successful 
PASS:  Cursor advance successful when new task enqueue into one-task-existing ready queue. 
PASS:  Cursor set to NULL when no task. 
PASS:  Cursor set successful when cursor-pointed task remove; cursor advanced. 

=== Test Results ===
Tests passed: 16
Tests failed: 0
Total tests: 16
All tests PASSED!
RR-cursor based scheduler tests completed successfully.

Benchmark

Refer to the commit in the branch of forked repo . 0fb44b8

Approach

File discriptions

  • bench.py: Main controller, generates different scenarios and collects data, prints, and draws results
  • sched_cmp.c: Generate task and run one scenario based on given flags TEST_SCENARIOfrom bench.py.
  • task.c: Add original scheduler, active based on flag OLD from bench.py:
uint16_t sched_select_next_task(void)
{
#if OLD
    return sched_select_next_task_old();
#else
    return sched_select_next_task_new();
#endif
}
  • Static collection: insert time recorder in begin and end of scheduler code:
uint16_t sched_select_next_task_new(void)
{
    /* Static data */
    uint64_t u = _read_us();
    schedule_cnt++;
...
    /* Update static data */
    each_schedule_time = _read_us() - u;
    schedule_time += each_schedule_time;

    if (kcb->task_current)
        return new_task->id;
...
}

Secnarios

static const struct {
    const char *name;
    uint32_t task_count;
    int task_active_ratio;
} perf_tests[] = {
    {"Minimal Active", 500, 2}, /* 2% tasks available */
    {"Moderate Active", 500, 4}, {"Heavy Active", 500, 20},
    {"Stress Test", 500, 50},    {"Full Load Test", 500, 100},
};

Experiments steps

  • Run 40s each scenario with new scheduler, obtain average and max scheduling time of each scenario.
  • Follow above setp, run old scheduler by flag OLD, compute improvement by average_new/ average_old.
  • Each scenario run 20 times and calculate average improvement
  • Run all scenarios.

Result

Improvement of each scenario

Per-scenario statistics (improvement vs OLD):                                                                                                                                 
                                                                                                                                                                              
Scenario 'Minimal Active':                                                                                                                                                    
  mean improvement        = 3.78x faster                                                                                                                                      
  std dev of improvement  = 2.09x                                                                                                                                             
  min / max improvement   = 2.09x  /  11.76x                                                                                                                                  
  95% CI of improvement   = [2.87x, 4.70x]                                                                                                                                    
  mean old sched time     = 5358.8 us                                                                                                                                         
  mean new sched time     = 1426.25 us                                                                                                                                        
  max  old sched time     = 109.0 us                                                                                                                                          
  max  new sched time     = 18.0 us  

Scenario 'Moderate Active':                                                                                                                                                   
  mean improvement        = 2.63x faster                                                                                                                                      
  std dev of improvement  = 0.47x                                                                                                                                             
  min / max improvement   = 1.95x  /  3.89x                                                                                                                                   
  95% CI of improvement   = [2.42x, 2.83x]                                                                                                                                    
  mean old sched time     = 3099.5 us                                                                                                                                         
  mean new sched time     = 1186.3 us                                                                                                                                         
  max  old sched time     = 53.0 us                                                                                                                                           
  max  new sched time     = 18.0 us                                                                                                                                           
                                                                                                                                                                              
Scenario 'Heavy Active':                                                                                                                                                      
  mean improvement        = 1.26x faster                                                                                                                                      
  std dev of improvement  = 0.18x                                                                                                                                             
  min / max improvement   = 0.79x  /  1.44x                                                                                                                                   
  95% CI of improvement   = [1.18x, 1.34x]                                                                                                                                    
  mean old sched time     = 1567.25 us                                                                                                                                        
  mean new sched time     = 1274.7 us                                                                                                                                         
  max  old sched time     = 21.0 us                                                                                                                                           
  max  new sched time     = 18.0 us                                                                                                                                           
                                                                                                                                                                              
Scenario 'Stress Test':                                                                                                                                                       
  mean improvement        = 1.12x faster                                                                                                                                      
  std dev of improvement  = 0.08x                                                                                                                                             
  min / max improvement   = 0.93x  /  1.20x                                                                                                                                   
  95% CI of improvement   = [1.09x, 1.15x]                                                                                                                                    
  mean old sched time     = 1349.85 us                                                                                                                                        
  mean new sched time     = 1213.0 us                                                                                                                                         
  max  old sched time     = 31.0 us                                                                                                                                           
  max  new sched time     = 24.0 us                                                                                                                                           
                                                                                                                                                                              
Scenario 'Full Load Test':                                                                                                                                                    
  mean improvement        = 0.90x (slower than OLD)                                                                                                                           
  std dev of improvement  = 0.10x                                                                                                                                             
  min / max improvement   = 0.66x  /  1.06x                                                                                                                                   
  95% CI of improvement   = [0.86x, 0.95x]                                                                                                                                    
  mean old sched time     = 1239.65 us                                                                                                                                        
  mean new sched time     = 1380.55 us                                                                                                                                        
  max  old sched time     = 32.0 us                                                                                                                                           
  max  new sched time     = 64.0 us

Comparison of new and old scheduler
image

New scheduler average scheduling time
image

Note

This is the complete version, the draft #23 , #38 and #37 will be closed and not be updated anymore.


Summary by cubic

Switches the scheduler to O(1) selection using per‑priority ready queues and a ready bitmap with RR cursors for fairness. Adds a system idle task and integrates ready‑queue operations across task state transitions.

  • New Features
    • O(1) task pick: highest ready priority via bitmap (De Bruijn), advance per‑priority rr_cursor.
    • Added kcb_t fields (ready_bitmap, ready_queues[], queue_counts[], rr_cursors[]) and embedded rq_node in tcb_t; intrusive list helpers and bitmap ops.
    • Enqueue/dequeue now append/remove embedded nodes and keep counts, cursors, and bitmap in sync.
    • Introduced idle_task_init/sched_idle; boot starts in idle and switches there when queues are empty.
    • Refactored task APIs (spawn, delay, suspend/resume, cancel, block, semaphore/mutex wakeups) to use enqueue/dequeue; priority changes migrate ready queues and running tasks yield.

Written for commit 1d9d7a3. Summary will update automatically on new commits.

@vicLin8712 vicLin8712 changed the title Add foundational data structures and enqueue/dequeue API updates for the O(1) scheduler [1/4] O(1) scheduler: Introduce infrastructure Nov 18, 2025
@vicLin8712 vicLin8712 changed the title [1/4] O(1) scheduler: Introduce infrastructure [1/5] O(1) scheduler: Introduce infrastructure Nov 19, 2025
@vicLin8712 vicLin8712 changed the title [1/5] O(1) scheduler: Introduce infrastructure [1/3] O(1) scheduler: Introduce infrastructure Nov 19, 2025
@jserv jserv changed the title [1/3] O(1) scheduler: Introduce infrastructure O(1) scheduler: Introduce infrastructure Nov 19, 2025
@jserv
Copy link
Contributor

jserv commented Nov 19, 2025

Do not include numbers in pull-request subjects.

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest 'main' branch to resolve unintended CI/CD regressions.

/* Hart-Specific Data */
uint8_t hart_id; /* RISC-V hart identifier */

} sched_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I’ve asked several questions about per-hart data structures before, but I didn't get a definitive answer at the time. If we want to split out per-hart data structures to facilitate future SMP integration, how do we distinguish which queues should go into the per-hart structure and which should remain in the global one in a multi-hart scenario? You retained the sched_t design, but I'm still unclear if this is a reasonable approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the description for Patch 1 doesn't seem quite right to me. It states that sched_t is introduced to support the O(1) scheduler. However, IIUC, we could achieve the exact same behavior by placing everything in kcb_t if we disregard SMP support for now. Therefore, the existence of sched_t seems to be more about preparing for future SMP support rather than enabling the O(1) scheduler itself. Did I miss something? Is there actually a direct connection between the two?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point. I originally separated the scheduler-related data structures for readability, but it’s not essential for this PR.

I will embed the separated section back into kcb_t to keep this patch focused on the functional changes.

Thanks for your suggestions.

@visitorckw
Copy link
Collaborator

I'm also a bit confused as to why the original PR was closed and split into three separate ones. It seems all three are related to O(1) scheduler support and are interdependent.

This actually makes the review process more difficult because the subsequent PRs include commits and changes from the previous ones. It also makes it much harder for me to track down previous discussions.

@vicLin8712
Copy link
Collaborator Author

I'm also a bit confused as to why the original PR was closed and split into three separate ones. It seems all three are related to O(1) scheduler support and are interdependent.

This actually makes the review process more difficult because the subsequent PRs include commits and changes from the previous ones. It also makes it much harder for me to track down previous discussions.

Thanks for the feedback, and I understand the confusion.

The original PR contained too many changes at once — new data structures, new APIs, logic refactoring, and the introduction of the O(1) scheduler. I split the work into three PRs (data structures, task state transitions, and final scheduler activation) so the series would be easier to review. Each PR remains compilable and all applications in apps/ continue to run with the old scheduler.

However, I understand that the current split causes later PRs to include commits from earlier ones, which complicates the review process.

Could you share your thoughts on what approach would be most appropriate for this project? For example, would you prefer a single PR with a clean commit history, or a small series of self-contained PRs?

I’d be happy to reorganize the work to match the project’s preferred workflow.

@vicLin8712 vicLin8712 force-pushed the o1-sched-basic branch 5 times, most recently from 24c5367 to eaf8b0c Compare November 21, 2025 13:36
@vicLin8712 vicLin8712 changed the title O(1) scheduler: Introduce infrastructure O(1) scheduler Nov 21, 2025
@vicLin8712 vicLin8712 force-pushed the o1-sched-basic branch 2 times, most recently from e995be1 to 6f37b93 Compare November 21, 2025 14:23
@vicLin8712
Copy link
Collaborator Author

@jserv ,

The following atomic operation in mutex.c includes task state transition.

/* Atomic block operation with enhanced error checking */
static void mutex_block_atomic(list_t *waiters)
{
...
    /* Block and yield atomically */
    self->state = TASK_BLOCKED;
    _yield(); /* This releases NOSCHED when we context switch */
}

In this commit series, the dequeuing path has been added in _sched_block() so that it can handle the correct RUNNING/READY → BLOCKED transition and corresponding bitmap operation. Still, its queue_t parameter doesn’t match the list_t used by mutex_block_atomic(), and the mutex doesn’t need the timer work either.
Should I extend _sched_block() to support list-based waiters, or create a small other helper for the mutex path?

This commit extends the core scheduler data structures to support
the new O(1) scheduler design.

Adds in tcb_t:

 - rq_node: embedded list node for ready-queue membership used
   during task state transitions. This avoids redundant malloc/free
   for per-enqueue/dequeue nodes by tying the node's lifetime to
   the task control block.

Adds in kcb_t:

 - ready_bitmap: 8-bit bitmap tracking which priority levels have
   runnable tasks.
 - ready_queues[]: per-priority ready queues for O(1) task
   selection.
 - queue_counts[]: per-priority runnable task counters used for
   bookkeeping and consistency checks.
 - rr_cursors[]: round-robin cursor per priority level to support
   fair selection within the same priority.

These additions are structural only and prepare the scheduler for
O(1) ready-queue operations; they do not change behavior yet.
When a task is enqueued into or dequeued from the ready queue, the
bitmap that indicates the ready queue state should be updated.

These two help functions can be used in relatives functions when
bitmap operations are required.
Previously, list_pushback() and list_remove() were the only list APIs
available for manipulating task lists. However, both functions perform
malloc/free internally, which is unnecessary and inefficient during
task state transitions.

The previous commit introduced an intrusive ready-queue membership
structure in tcb_t. To support this design and improve efficiency,
this commit adds two helper functions for intrusive list manipulation,
eliminating overhead malloc/free operation during task lifecycle.

 - list_pushback_node(): append an existing node to the end of the
   list in O(n) time without allocating memory.

 - list_remove_node(): remove a node from the list without freeing it.

 Both helper functions are operated in O(n) by linearly searching
 method.
This commit refactors mo_enqueue_task() to support the data structures
for the new scheduler.

The RR cursor must be advanced if it same as the running task, which
only happens in one task in ready queue. RR cursor will point to new
enqueued task for scheduler consistency.
This commit refactors mo_dequeue_task() to support data structures for
the new scheduler.

 - Set the RR cursor to NULL when no task remains in the ready queue
 - Circularly advance the RR cursor if it currently points to
   the task being dequeued, avoid RR cursor points to unlinked node.
This commit refactors all task operation APIs that are related to task
state transitions to support the new scheduler. The simplified
mo_enqueue_task() and mo_dequeue_task() routines are now invoked
directly inside these operations.

Enqueue and dequeue actions are performed only when the state
transition crosses the following groups:

  {TASK_RUNNING, TASK_READY} ↔ {other states}

The sections below describe the detailed changes for each API:

 - sched_wakeup_task(): Add TASK_RUNNING as part of the state-group
   complement, avoid running tasks enqueue again.

 - mo_task_cancel(): Cancel all tasks except TASK_RUNNING. If the task
   is in TASK_READY, mo_dequeue_task() is invoked before cancellation.

 - mo_task_delay(): Transition from TASK_RUNNING to TASK_BLOCKED;
   call mo_dequeue_task() accordingly.

 - mo_task_suspend(): This API can be called for both TASK_RUNNING and
   TASK_READY tasks. Both conditions require invoking mo_dequeue_task()
   before transitioning to TASK_SUSPEND.

 - mo_task_resume(): Transition from TASK_SUSPEND to TASK_READY;
   call mo_enqueue_task().

 - _sched_block(): Invoked only when a TASK_RUNNING task calls mutex-
   related APIs; performs the TASK_RUNNING → TASK_BLOCKED transition.
The previous implementation compared kcb->task_current directly with
the task's list node, which became incorrect after introducing the
embedded ready-queue list-node structure. This commit updates the
condition to compare the underlying task object instead:

    kcb->task_current->data == task

This ensures mo_task_suspend() correctly detects when the suspended
task is the currently running one.
This commit adds the missing enqueuing path for awakened tasks during
semaphore signaling and mutex unlocking, ensuring that tasks are
correctly inserted into the ready queue under the new scheduler design.
@vicLin8712 vicLin8712 force-pushed the o1-sched-basic branch 2 times, most recently from 753ae21 to 1d9d7a3 Compare November 22, 2025 04:14
Previously, mutex_block_atomic() only updated the task state. Under the new
scheduler design, the blocked task must also be removed from the ready queue.

The existing helper _sched_block() does not match the mutex path because it
operates on queue_t instead of list_t and also processes deferred timer work,
which mutex locking does not require.

This commit introduces _sched_block_mutex(), a helper that supports the list-
based waiter structure and skips deferred timer handling. It will be used by
the mutex lock APIs in a later change.
This commit replaces mutex_block_atomic() with _sched_block_mutex() to align
mutex blocking behavior with the new scheduler design.

Blocked tasks are now properly dequeued from the ready queue, and no deferred
timer processing is performed.
This commit introduces a new helper, sched_migrate_task(), which migrates
a task between ready queues of different priority levels and will be
introduced in mo_task_priority() for readability.
This commit introduces the sched_migrate_task() helper, which handles
migration of a task to the correct ready queue when its priority changes.
If the task is already in a ready queue, the helper dequeues it from the
old priority level, enqueues it into the new one, and updates all related
bookkeeping.

In addition, if a TASK_RUNNING task changes its priority, it now yields
immediately. This ensures that the scheduler always executes tasks in
strict priority order, preventing a running task from continuing to run
at an outdated priority level.
This commit adds the system idle task and its initialization routine,
idle_task_init(). The idle task serves as the default execution context
when no runnable tasks exist. It never enters any ready queue and always
uses the fixed priority TASK_PRIO_IDLE.

Introducing a dedicated idle task ensures consistent scheduler entry
during startup, strict ordering for user tasks, and allows priority
adjustments before user tasks run for the first time.
This change sets up the scheduler state during system startup by
assigning kcb->task_current to kcb->harts->task_idle and dispatching
to the idle task as the first execution context.

This commit also keeps the scheduling entry path consistent between
startup and runtime.
This commit refactors mo_task_spawn() to align with the new O(1) scheduler
design. The task control block (tcb_t) embeds its list node during task
creation.

The enqueue operation is moved inside a critical section to guarantee
consistent enqueuing process during task creation.

The “first task assignment” logic is removed because first task has been
assigned to system idle task as previous commit mentioned.
When all ready queues are empty, the scheduler should switch
to idle mode and wait for incoming interrupts. This commit
introduces a dedicated helper to handle that transition,
centralizing the logic and improving readbility of the
scheduler path to idle.
This commit introduces a 32-entry De Bruijn lookup table to support
constant-time bitmap index computation. This mechanism will be used in
later commits to replace iterative bit-scanning when selecting the next
runnable priority.

The helper itself does not change any scheduling behavior yet, but lays
the groundwork for the new O(1) scheduler’s priority computation path.
Previously, the scheduler performed an O(N) scan of the global task list
(kcb->tasks) to locate the next TASK_READY task. This resulted in
non-deterministic selection latency and unstable round-robin rotation
under heavy load or frequent task state transitions.

This change introduces a strict O(1) scheduler based on per-priority
ready queues and round-robin (RR) cursors. Each priority level maintains
its own ready queue and cursor, enabling constant-time selection of the
next runnable task while preserving fairness within the same priority.

Additionally, when all tasks are non-runnable, the scheduler now switches
directly to the system idle task after bitmap lookup, ensuring consistent
control flow and eliminating unnecessary scanning paths.
@vicLin8712 vicLin8712 marked this pull request as ready for review November 23, 2025 06:05
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (all 2 issues)

Understand the root cause of the following 2 issues and fix them.


<file name="kernel/task.c">

<violation number="1" location="kernel/task.c:914">
First task creation no longer initializes `kcb-&gt;task_current`, so the new scheduler immediately panics on startup because `sched_select_next_task` still requires a non-null current-task pointer.</violation>
</file>

<file name="include/lib/list.h">

<violation number="1" location="include/lib/list.h:106">
list_pushback_node dereferences target-&gt;next even though new rq_node instances never initialise that field, so the guard invokes undefined behaviour and can prevent the first enqueue.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

tcb->rq_node.data = tcb;

/* Push node to ready queue */
sched_enqueue_task(tcb);
Copy link

@cubic-dev-ai cubic-dev-ai bot Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First task creation no longer initializes kcb->task_current, so the new scheduler immediately panics on startup because sched_select_next_task still requires a non-null current-task pointer.

Prompt for AI agents
Address the following comment on kernel/task.c at line 914:

<comment>First task creation no longer initializes `kcb-&gt;task_current`, so the new scheduler immediately panics on startup because `sched_select_next_task` still requires a non-null current-task pointer.</comment>

<file context>
@@ -770,8 +907,11 @@ int32_t mo_task_spawn(void *task_entry, uint16_t stack_size_req)
+    tcb-&gt;rq_node.data = tcb;
+
+    /* Push node to ready queue */
+    sched_enqueue_task(tcb);
 
     CRITICAL_LEAVE();
</file context>
Fix with Cubic

/* Pushback list node into list */
static inline void list_pushback_node(list_t *list, list_node_t *target)
{
if (unlikely(!list || !target || target->next))
Copy link

@cubic-dev-ai cubic-dev-ai bot Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_pushback_node dereferences target->next even though new rq_node instances never initialise that field, so the guard invokes undefined behaviour and can prevent the first enqueue.

Prompt for AI agents
Address the following comment on include/lib/list.h at line 106:

<comment>list_pushback_node dereferences target-&gt;next even though new rq_node instances never initialise that field, so the guard invokes undefined behaviour and can prevent the first enqueue.</comment>

<file context>
@@ -100,6 +100,24 @@ static inline list_node_t *list_pushback(list_t *list, void *data)
+/* Pushback list node into list */
+static inline void list_pushback_node(list_t *list, list_node_t *target)
+{
+    if (unlikely(!list || !target || target-&gt;next))
+        return;
+
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants